Justin's comments for improving this demo

  • Your code is very, very repetitious. Your code is not meant as a simple one-time project; we are writing code that is to be maintained by future collaborators of this project, for years to come. Please keep this in mind while you're writing code.

  • While your demo app looked nice and polished, when I took a second look at this document, the app was not well documented at all. I spent a full day looking at this script and trying to improve it (and demonstrate the improvement, in terms of speed and readablility), but I'm not close to being done. This shows:

  1. How slow/inefficient your code was (e.g. many repeated cbinds and rbinds and for(i in 1:100){ a = c(a, length(i))}), and

  2. How difficult it is to read this document. (e.g. your code blocks were explained with partial sentences; at some point, you just wrote two incomplete sentences, not meant for anyone else but yourself:

- (single dataframe)
-  easier to handle using plot_ly() and highlight_key()
  1. a lot of copy pasting of code; for instance, column names were set by colnames(pie_table)<-c("Time","cluster_1","cluster_2","cluster_3","cluster_4","cluster_5","cluster_6","cluster_7","cluster_8","cluster_9","cluster_10") instead of colnames(pie_table) <- c("Time", paste0("cluster_", 1:10)). Surely, if you've taken coding class you are capable of the latter.
  • You had some comments embedded in the code, which is great, but they were not comprehensive enough for me to know what was going on.

  • I've mentioned this several times, but your code should be modularized into functions whenever possible. I can give you an example of (fake) readable code that I would have expected to see something like:

filedir = "~/Downloads/for-henry/res.RDS"
res = readRDS(filedir)
pie_table = form_pie_table(res)
all_2d_tables = form_2d_tables(res)
plist = lapply(all_2d_tables, plot_2d_table())
subplot(plist[[1]], plist[[2]], plist[[3]],
        shareX = TRUE, shareY = TRUE)

Then, if I want to know how form_pie_table() works, I can simply look into the function definition, something like:

##' Function to form a table of cluster probabilities.
##' 
##' @param res Object of class "flowmix"
##'
##' @return Matrix of size (TT x 5), whose columns are ..
form_pie_table <- function(res){
  res
}
  • The long for-loop for producing the three tables (e.g. diam_mid_chl_small_table) should be rewritten to be more efficient and readable. I would start with the following steps (1-4):
  1. First prescribe to yourself what the end product needs to be e.g. three tables each containing count data from two axes at a time, as well as the x and y values of the points to plot.
  2. Make and call a function for parts of the code that are repeated. A lot of code is simply copy-pasted here (the whole code block starting with ## Temporary exists three times) -- you should avoid this.
  3. Try to eliminate as many duplicated operations as you can. You've clearly copy-pasted identical code multiple times. You should avoid this.
  4. Try to avoid the repeated concatenations e.g. diam_mid_chl_small_table <- rbind(diam_mid_chl_small_table,y) This makes the code very slow.
  • (I've in fact done a lot of this already; see my new code block labeled "2d-tables" which replaces your long for-loop. I've used a function called collapse_3d_to_2d(y, counts, dims=1:2) from my flowmix R package, just so you know.

  • As much as possible, don't hard-code anything (use numclust instead of 10 for number of clusters, dimdat instead of 3 for dimension of data, TT instead of 296, etc.)

  • In the following block of code, instead of the loop, you could just do this: time = sapply(ylist, nrow). This is faster and shorter. (I understand why you might do this if you are coming from a C background, but in R, for-loops + concatenating is very slow, since R has to make new copies and allocate new memory of the object for every iteration of the for loop; see https://stackoverflow.com/questions/14693956/how-can-i-prevent-rbind-from-geting-really-slow-as-dataframe-grows-larger for an entertaining discussion about this).

time <- c()
for(i in ylist){
  time <- c(time,nrow(i))
}
time_column <- rep(Time, time)
  • Your object names should be more descriptive. You have time and Times and TT; it's not clear which is what. In fact, time is the number of rows of each element in ylist, and not related to time at all.

  • Just to be clear, in all of my code, I'm using TT to encode the number of time points (e.g. TT == length(ylist)), and tt to mean a specific time point e.g. for(tt in 1:TT){ ...}. It would be great if we could keep this a convention.

  • Now that you know how to program in R, you should definitely read this style guide http://adv-r.had.co.nz/Style.html (old but short and useful version) and https://style.tidyverse.org/syntax.html (newer and longer version) and adhere rigorously. For future reference, Hadley Wickham is a statistician who is famous for improving statistical software (e.g. R, tidyverse, dplyr); his advice found on the web is usually sound and useful.

  • This document is clearly not knittable. That's, in general, the whole point of using R Markdown -- you have a single ".Rmd" file that you can knit to reproduce your data analysis. I should be able to hit "knit" (Ctrl+Shift+K) in my Rstudio and be able to produce a complete document; the same one as yours. From hereon, when you share Rmd documents, could you make sure that these are completely knittable.

  • The code doesn't work as intended -- my new code can't reproduce your p1 -- which is the obvious first thing to fix.

  • If you look at the code block called animation1, it seems that the background scatterplot points (not the cluster means) are not designed to move over time (I could be wrong -- let me know if I am). You only use the first 500 columns of the large data frame. What happens when you use more rows, or all rows? Is this too slow, or prohibitive memory-wise? If so, doesn't that mean that this approach either needs to be improved or replaced?

The rest of this document (up till before "The rest of the code") is my attempt at cleaning while trying to understand what you did. Please study it carefully, compare it to your old code (feel free to ask any questions), address all of the points I've made, and make a more complete demo.Rmd and share it with me by the next time we meet.

Load Data

The file res.RDS will be loaded into a list called res.

## filedir = "C:/Users/Administrator/Documents/GitHub/flowcy-shiny/learning-plotly/res.RDS"
filedir = "~/Downloads/for-henry/res.RDS"
res = readRDS(filedir)
countslist = res$countslist
ylist = res$ylist
mn = res$mn
numclust = res$numclust
TT = length(ylist)
Time <- names(ylist)

Read data

Some points:

  1. Elements from res will stored into three dataframes.
  2. There are 296 unique time points in the data.
  3. The response (ylist) will be collapsed into two dimensions at a time, and will be stored into 3 dataframes called diam_chl_table, pe_chl_table, and diam_pe_table.

First, setting up the data:

## Form `pie_table`
pie_table <- data.frame(res$pie)
pie_table <- setDT(pie_table, keep.rownames = "Time")[]
colnames(pie_table) <- c("Time", paste0("cluster_", 1:10))

# Reshape pie_table into a dataframe with 3 columns:
# Time : time corresponding to this row
# Cluster : Which cluster is it; an integer from 1 to 10
# Prob : the probability of this cluster at this time
pie_table <- pie_table%>%
  gather(var,val,-c(1))%>%
  separate(var,c("type","cluster"))%>%
  .[-2]
colnames(pie_table) <- c("Time", "Cluster","Prob")

## Now, combining the **cluster mean** information with the **cluster
## probability**, in `pie_table`. (Every row of `pie_table` is a unique (time,
## cluster, point); SH: is this correct?)
dfs = lapply(1:numclust, function(iclust){
  one_table = tibble(diam_mid_pie = mn[,1,iclust],
                     chl_small_pie = mn[,1,iclust],
                     pe_pie = mn[,1,iclust],
                     Time = Time, Cluster = rep(iclust, TT))
  colnames(one_table)[1:3] <- c("diam_mid_pie","chl_small_pie","pe_pie")
  one_table
})
mn_table = bind_rows(dfs)
pie_table = merge(pie_table, mn_table)
## Form a y_table
y_table = do.call(rbind, ylist) %>% as.data.frame()
reptimes = sapply(ylist, nrow)
time_column <- rep(Time, reptimes)
y_table$Time <- time_column
time_justin_code = microbenchmark::microbenchmark({

  ## Setup
  alltimes = names(ylist)
  TT = length(ylist)
  tablist = list()
  dimslist = list(c(1,2), c(2,3), c(3,1))

  ## Obtain all the tables once.
  for(ii in 1:length(dimslist)){

    dims = dimslist[[ii]]

    ## Form the table
    all_2d_tables = lapply(1:TT, function(tt){
      y = ylist[[tt]][,dims]
      counts = countslist[[tt]]
      collapse_3d_to_2d(y, counts, dims=1:2)
    })
    combined_2d_table = do.call(rbind, all_2d_tables) %>% as_tibble()

    ## Add time
    reptimes = sapply(all_2d_tables, nrow)
    times = rep(alltimes, reptimes)
    combined_2d_table[,"times"] = times
    newname = paste0("counts_", ii)
    combined_2d_table <- rename(combined_2d_table, !!newname:=counts)
    tablist[[ii]] = combined_2d_table
  }
  diam_chl_table <- tablist[[1]]
  pe_chl_table <- tablist[[2]]
  diam_pe_table <- tablist[[3]] 
}, times = 1)

## The above code is 5-6 times faster than your old code, and is easier to read.

##       min        lq      mean    median        uq       max neval
##  5.837105  5.837105  5.837105  5.837105  5.837105  5.837105     1
## 32.042188 32.042188 32.042188 32.042188 32.042188 32.042188     1

Animation of three scatterplots

  • (single dataframe)
  • easier to handle using plot_ly() and highlight_key()
# To have a quickier view; Delete this line on final presentation
# generate three plots

# Highlight key 
key <- highlight_key(pie_table,~Cluster)

p1 <- plot_ly(diam_chl_table[1:500,],
              x = ~diam_mid,
              y = ~chl_small,
              size = ~ counts_1,
              type = 'scatter',
              mode = 'markers',sizes = c(0,1000), opacity = 0.5,alpha = 0.3,
              showlegend = FALSE) %>%
  add_trace(data = key,
            x = ~diam_mid_pie,
            y = ~chl_small_pie,
            frame = ~Time,
            size = 5,
            hoverinfo = 'text',
            text = ~paste("Cluster", Cluster),
            hovertext = ~paste("Cluster", Cluster),
            textposition = "bottom center",
            mode = 'markers+text',
            textfont = list(size = 14),
            opacity = 0.9)
p1

The rest of the code

p2 and p3

p2 <- plot_ly(pe_chl_small_table[1:500,],
              y = ~pe,
              x = ~chl,
              size = ~ counts_2,
              type = 'scatter', mode = 'markers',alpha = 0.3,
              opacity=0.5,sizes = c(0,1000),showlegend = FALSE)%>%
  add_trace(data = key,
            x = ~chl_small_pie,
            y = ~pe_pie,
            frame = ~Time,
            showlegend = FALSE, 
            size = 5,
            hoverinfo = 'text',
            text = ~paste("Cluster",Cluster),
            hovertext = ~paste("Cluster",Cluster),
            textposition = "bottom center",
            mode = 'markers+text',
            textfont = list(size = 14),
            opacity = 0.9
            )

p3<-plot_ly(diam_mid_pe_table[1:500,],
            x= ~pe,
            y = ~diam_mid,
            size = ~ counts_3,
            type = 'scatter',
            mode = 'markers',
            opacity=0.5, sizes = c(0,1000),alpha = 0.3, showlegend = FALSE)%>%
  add_trace(data = key,
            x = ~pe_pie,
            y = ~diam_mid_pie,
            frame = ~Time,
            size = 5,
            hoverinfo = 'text',
            text = ~paste("Cluster",Cluster),
            hovertext = ~paste("Cluster",Cluster),
            textposition = "bottom center",
            mode = 'markers+text',
            textfont = list(size = 14),
            opacity = 0.9
            )
scatterplot <- subplot(p1,p2,p3,shareX = TRUE,shareY = TRUE)
scatterplot

Animation of the covariates table

# Colors
myColor = c("red","green","orange","blue","brown","deeppink","navyblue","orchid","seagreen","gold")


# create highlight key, which highlights by cluster
key <- highlight_key(pie_table,~Cluster)
pie_plot <- plot_ly(key)%>%
  group_by(Cluster)%>%
  add_lines(x = ~Time , y = ~Prob, color = ~Cluster, colors = myColor ,)%>%
  add_segments(x = ~Time, xend = ~Time, y = 0, yend = 1, frame = ~Time,showlegend = FALSE, color = "red",showlegend = FALSE)

# modify the x-axis  
pie_plot <- pie_plot%>%layout(
  xaxis = list(
    type = "date",
    tickformat = "%d%M%Y",showticklabels = FALSE)
)
pie_plot

Combine two plots

# generate three plots
 subplot(scatterplot,pie_plot,nrows = 2)